fix: drop updated_at guard from reconcile work-item ack#1149
Conversation
📝 WalkthroughWalkthroughRemoved the optimistic-concurrency ChangesWork Item Deletion Simplification
Sequence Diagram (high-level ack/delete flow): sequenceDiagram
participant Worker
participant Queue
participant PostgresDB
Worker->>Queue: AckSuccess(ItemID, WorkerID)
Queue->>PostgresDB: DELETE ... WHERE id = ItemID AND claimed_by = WorkerID
PostgresDB-->>Queue: ackResult{Deleted: bool}
Queue-->>Worker: ackResult
alt Deleted == false
Worker->>Worker: log error
Worker->>Hooks: OnDropped(error)
else Deleted == true
Worker->>Worker: continue / call OnProcessed
end
🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers:
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/workspace-engine/test/controllers/harness/helpers.go`:
- Around line 75-78: In DrainQueue, don't ignore the result of queue.AckSuccess:
capture the returned error/result from queue.AckSuccess(ctx,
reconcile.AckSuccessParams{ItemID: item.ID, WorkerID: testWorkerID}) and handle
failures explicitly (e.g., return the error or record/log and stop processing
the item so it can be retried) instead of discarding it; update the DrainQueue
logic to treat non-deleting/failed acks as unsuccessful by checking the error
and acting accordingly so failing acks won't be treated as processed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ebe4684e-6477-4968-91bf-b4a5f8a40402
📒 Files selected for processing (9)
apps/workspace-engine/pkg/reconcile/memory/memory.goapps/workspace-engine/pkg/reconcile/memory/memory_test.goapps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.goapps/workspace-engine/pkg/reconcile/postgres/pg.goapps/workspace-engine/pkg/reconcile/postgres/pg_integration_test.goapps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sqlapps/workspace-engine/pkg/reconcile/worker.goapps/workspace-engine/pkg/reconcile/workqueue.goapps/workspace-engine/test/controllers/harness/helpers.go
💤 Files with no reviewable changes (1)
- apps/workspace-engine/pkg/reconcile/postgres/pg.go
There was a problem hiding this comment.
Pull request overview
This PR fixes reconcile work-item acknowledgements failing after lease heartbeats by removing the updated_at-snapshot guard from AckSuccess across the reconcile queue API and its implementations (Postgres + in-memory). It also adds a defensive runtime check in the worker to surface unexpected “owned-but-not-deleted” ack results.
Changes:
- Remove
ClaimedUpdatedAtfromreconcile.AckSuccessParamsand update all call sites. - Update Postgres
DeleteClaimedReconcileWorkItemto drop theupdated_atpredicate so lease heartbeats no longer block deletion. - Add worker-side handling/logging when
AckSuccessreturnsDeleted=falsewithout error.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/workspace-engine/test/controllers/harness/helpers.go | Updates test harness to call AckSuccess without the removed ClaimedUpdatedAt. |
| apps/workspace-engine/pkg/reconcile/workqueue.go | Removes ClaimedUpdatedAt from the public reconcile queue ack params type. |
| apps/workspace-engine/pkg/reconcile/worker.go | Uses ack result and adds a branch for Deleted=false to drop/log rather than marking processed. |
| apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql | Removes the updated_at guard from the delete-claimed query and updates its documentation. |
| apps/workspace-engine/pkg/reconcile/postgres/pg.go | Stops passing UpdatedAt into the sqlc-generated delete-claimed query. |
| apps/workspace-engine/pkg/reconcile/postgres/pg_integration_test.go | Updates ack calls for the new AckSuccessParams shape. |
| apps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.go | Regenerates sqlc output for the updated delete-claimed query signature and SQL. |
| apps/workspace-engine/pkg/reconcile/memory/memory.go | Removes the in-memory updated-at snapshot check from AckSuccess. |
| apps/workspace-engine/pkg/reconcile/memory/memory_test.go | Updates ack calls for the new AckSuccessParams shape. |
Files not reviewed (1)
- apps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.go: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -137,7 +140,6 @@ WITH target_scope AS ( | |||
| DELETE FROM reconcile_work_scope AS s | |||
| USING target_scope AS t | |||
| WHERE s.id = t.id | |||
| ack, err := queue.AckSuccess(ctx, reconcile.AckSuccessParams{ | ||
| ItemID: item.ID, | ||
| WorkerID: "worker-a", | ||
| ClaimedUpdatedAt: item.UpdatedAt, | ||
| ItemID: item.ID, | ||
| WorkerID: "worker-a", | ||
| }) |
| if !ackResult.Deleted { | ||
| slog.ErrorContext(ctx, | ||
| "ack reported ownership but did not delete row — row will be re-claimed", | ||
| "item", item.ID, | ||
| "kind", item.Kind, |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/workspace-engine/pkg/reconcile/worker_test.go`:
- Around line 522-524: The mock ackFn in the subtest returns
AckSuccessResult{Deleted: false} but leaves Owned as the zero value; update the
mock return to AckSuccessResult{Owned: true, Deleted: false} in the ackFn used
by this subtest so it matches the intended regression shape (Owned=true,
Deleted=false) when invoking the function under test (ackFn with
AckSuccessParams).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7e269e65-0b34-46c3-8820-1c0028c312a7
📒 Files selected for processing (2)
apps/workspace-engine/pkg/reconcile/postgres/pg_integration_test.goapps/workspace-engine/pkg/reconcile/worker_test.go
| ackFn: func(context.Context, AckSuccessParams) (AckSuccessResult, error) { | ||
| return AckSuccessResult{Deleted: false}, nil | ||
| }, |
There was a problem hiding this comment.
Match the mocked ack result to the intended regression shape (Owned=true, Deleted=false).
This subtest targets the Owned=true/Deleted=false case, but the mock currently leaves Owned at its zero value (false), which weakens the regression coverage.
Suggested fix
q := &fakeQueue{
ackFn: func(context.Context, AckSuccessParams) (AckSuccessResult, error) {
- return AckSuccessResult{Deleted: false}, nil
+ return AckSuccessResult{Owned: true, Deleted: false}, nil
},
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ackFn: func(context.Context, AckSuccessParams) (AckSuccessResult, error) { | |
| return AckSuccessResult{Deleted: false}, nil | |
| }, | |
| ackFn: func(context.Context, AckSuccessParams) (AckSuccessResult, error) { | |
| return AckSuccessResult{Owned: true, Deleted: false}, nil | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/workspace-engine/pkg/reconcile/worker_test.go` around lines 522 - 524,
The mock ackFn in the subtest returns AckSuccessResult{Deleted: false} but
leaves Owned as the zero value; update the mock return to
AckSuccessResult{Owned: true, Deleted: false} in the ackFn used by this subtest
so it matches the intended regression shape (Owned=true, Deleted=false) when
invoking the function under test (ackFn with AckSuccessParams).
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql (1)
128-149:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winMirror this row-lock ownership check into the other claim-mutating queries.
This fixes the snapshot-to-row-lock race for
DeleteClaimedReconcileWorkItem, butAckPermanentlyFailedReconcileWorkItemandRetryReconcileWorkItembelow still resolve ownership intarget_scopeand then mutate ons.id = t.idonly. After lease expiry, cleanup, and a re-claim by another worker, those statements can still delete or release a row the caller no longer owns.Proposed follow-up
-- name: AckPermanentlyFailedReconcileWorkItem :one WITH target_scope AS ( SELECT s.id FROM reconcile_work_scope AS s WHERE s.id = sqlc.arg(id) AND s.claimed_by = sqlc.arg(claimed_by) ), deleted_scope AS ( DELETE FROM reconcile_work_scope AS s USING target_scope AS t WHERE s.id = t.id + AND s.claimed_by = sqlc.arg(claimed_by) RETURNING s.id ) SELECT EXISTS (SELECT 1 FROM target_scope) AS owned, EXISTS (SELECT 1 FROM deleted_scope) AS deleted; @@ -- name: RetryReconcileWorkItem :one WITH target_scope AS ( SELECT s.id FROM reconcile_work_scope AS s WHERE s.id = sqlc.arg(id) AND s.claimed_by = sqlc.arg(claimed_by) ), released_scope AS ( UPDATE reconcile_work_scope AS s SET attempt_count = s.attempt_count + 1, last_error = sqlc.arg(last_error), not_before = now() + make_interval(secs => sqlc.arg(retry_backoff_seconds)::int), claimed_by = NULL, claimed_until = NULL, updated_at = now() FROM target_scope AS t - WHERE s.id = t.id + WHERE s.id = t.id + AND s.claimed_by = sqlc.arg(claimed_by) RETURNING s.id ) SELECT EXISTS (SELECT 1 FROM target_scope) AS owned, EXISTS (SELECT 1 FROM released_scope) AS released;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql` around lines 128 - 149, The other claim-mutating SQL statements (AckPermanentlyFailedReconcileWorkItem and RetryReconcileWorkItem) currently resolve ownership only in the target_scope CTE and then mutate rows using s.id = t.id, leaving a snapshot-to-row-lock race; update those queries to mirror the pattern used in DeleteClaimedReconcileWorkItem: add/keep a WITH target_scope AS (...) selecting id WHERE id = sqlc.arg(id) AND claimed_by = sqlc.arg(claimed_by), then perform DELETE/UPDATE using target_scope AS t and include AND s.claimed_by = sqlc.arg(claimed_by) in the WHERE so the claimed_by is re-checked while the row is locked (apply the same deleted_scope/RETURNING pattern for consistency).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@apps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sql`:
- Around line 128-149: The other claim-mutating SQL statements
(AckPermanentlyFailedReconcileWorkItem and RetryReconcileWorkItem) currently
resolve ownership only in the target_scope CTE and then mutate rows using s.id =
t.id, leaving a snapshot-to-row-lock race; update those queries to mirror the
pattern used in DeleteClaimedReconcileWorkItem: add/keep a WITH target_scope AS
(...) selecting id WHERE id = sqlc.arg(id) AND claimed_by =
sqlc.arg(claimed_by), then perform DELETE/UPDATE using target_scope AS t and
include AND s.claimed_by = sqlc.arg(claimed_by) in the WHERE so the claimed_by
is re-checked while the row is locked (apply the same deleted_scope/RETURNING
pattern for consistency).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7a3b0eda-b791-46c2-b022-9ca8a90c6e9f
📒 Files selected for processing (3)
apps/workspace-engine/pkg/reconcile/postgres/db/reconcile_work_item.sql.goapps/workspace-engine/pkg/reconcile/postgres/queries/reconcile_work_item.sqlapps/workspace-engine/test/controllers/harness/helpers.go
DeleteClaimedReconcileWorkItem required s.updated_at <= the
row's updated_at captured at claim time. That guard was meant
to detect concurrent modifications, but in practice the worker's
own lease heartbeat (every 5s) advanced updated_at, invalidating
the worker's own delete whenever Process() ran longer than one
heartbeat tick.
Effect: AckSuccess returned Owned=true, Deleted=false silently;
the row persisted and was re-claimed on every subsequent worker
poll, with attempt_count never incrementing. Items in prod
cycled this way for 5+ days.
Fix: drop the staleness guard. claimed_by = me is the actual
ownership invariant, and no other writer touches a claimed row
in this codebase. Mirror the change in the in-memory queue.
Also surface Owned=true/Deleted=false as an explicit error log
plus OnDropped hook in worker.processClaimedItem, so any future
regression of this class is visible immediately rather than
silently leaking items.
Summary by CodeRabbit
Bug Fixes
Refactor
Tests